[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777
[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777ShaneGZhu wants to merge 7 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required 任务:2/10 通过
2.2 可选任务 — 24/28 通过
3 失败详情(仅 required)Approval — 流程/审批问题(置信度: 高)Approval
根因详情:
目前两项均未满足,脚本报告 "There are 2 approved errors.",以 exit code 6 退出。 关键日志: 修复建议:
修复建议摘要: 联系 FD RD 及 PaddlePaddle RD 各进行一次 review 审批 链接: 查看日志 |
| from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import ( | ||
| fused_cast_sigmoid_bias, | ||
| ) | ||
| pass |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7777 +/- ##
==========================================
Coverage ? 63.15%
==========================================
Files ? 461
Lines ? 64138
Branches ? 9824
==========================================
Hits ? 40505
Misses ? 20851
Partials ? 2782
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ontrol whether to use the kernel-fused path.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 18:36:14
📋 Review 摘要
PR 概述:新增 grouped_topk CUDA fused kernel,将 cast + sigmoid + bias + noaux_tc 四步融合为单次 kernel launch,替代原 fused_cast_sigmoid_bias + noaux_tc 双 kernel 路径,并通过新 flag enable_moe_scores_elementwise_fuse 控制启用。
变更范围:custom_ops/gpu_ops/(新 CUDA kernel)、fastdeploy/model_executor/layers/moe/(三个 backend + moe.py)、fastdeploy/engine/args_utils.py、fastdeploy/scheduler/config.py
影响面 Tag:[OP] [Optimization]
📝 PR 规范检查
标题 [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc 存在两处问题:①包含两个 Tag(规范要求仅一个);②[Op] 大小写不符(官方列表为 [OP]);③Tag 与描述之间缺少空格。
标题建议(可直接复制):
[OP] Kernel fusion: cast+sigmoid+bias+noauxtc
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | PR 标题 | 两个 Tag、[Op] 大小写错误、Tag 后缺空格 |
| ❓ 疑问 | fastdeploy/engine/args_utils.py:344 |
默认 False 将移除原有 fused_cast_sigmoid_bias 优化,存量部署性能回退 |
| 🟡 建议 | fastdeploy/model_executor/layers/moe/moe.py:135 |
use_fused_cast=True + 冗余 EP 路径时,gating_output 未 cast 为 float32 即送入 noaux_tc_redundant |
| ❓ 疑问 | fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py:363 |
移除 FD_ENABLE_RL 保护,RL 模式下启用 fuse flag 时行为变化 |
总体评价
CUDA kernel 实现(BitonicSort / WarpSelect / Phase 1 & 2)逻辑完整,三个 backend 联动更新,测试覆盖四种典型模型配置,整体质量较好。主要关注点是默认值策略与边界路径的类型安全性,建议确认后合入。
| Chunk size of moe input. | ||
| """ | ||
|
|
||
| enable_moe_scores_elementwise_fuse: bool = False |
There was a problem hiding this comment.
❓ 疑问 enable_moe_scores_elementwise_fuse 默认 False 会移除原有默认激活的 fused_cast_sigmoid_bias 优化。
在此 PR 之前,非 RL CUDA 部署默认走 fused_cast_sigmoid_bias + noaux_tc(约 +2% TPS);合入后默认退回纯 Python sigmoid + noaux_tc(等效 baseline)。
存量部署无需改动配置即会悄然丢失原有性能收益,建议在 help 文档或 Release Notes 中说明迁移方式,或评估是否对 CUDA 平台默认开启。
| renormalize, | ||
| routed_scaling_factor, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
🟡 建议 当 use_fused_cast=True 且 expert_id_to_ep_rank_array is not None(冗余 EP 路径)时,调用方(cutlass/triton backend)不会 预先 cast gate_out 到 float32(if not use_fused: gate_out = gate_out.cast("float32") 被跳过),但此处 else 分支直接对 bfloat16/float16 的 gating_output 做 sigmoid 后与 float32 的 e_score_correction_bias 相加,可能引发隐式 cast 或 noaux_tc_redundant 的类型不匹配。
建议在此 else 分支入口添加 float32 cast 兜底:
else:
if gating_output.dtype != paddle.float32:
gating_output = gating_output.cast("float32")
scores = paddle.nn.functional.sigmoid(gating_output)| if fastdeploy.envs.FD_USE_PHI_MOE_PERMUTE and self.moe_quant_type == "w16a16": | ||
| if layer.topk_method == "noaux_tc": | ||
| use_fused = not fastdeploy.envs.FD_ENABLE_RL and current_platform.is_cuda() and not fc1_latent_proj | ||
| use_fused = ( |
There was a problem hiding this comment.
❓ 疑问 原有 not fastdeploy.envs.FD_ENABLE_RL 保护已被移除。当用户在 RL 训练模式(FD_ENABLE_RL=True)下同时启用 enable_moe_scores_elementwise_fuse=True 时,fused kernel 将运行于 RL 场景。
请确认:①原 RL 保护是有意为之(RL 模式下 fused kernel 存在正确性/兼容性问题),还是历史遗留约束可以移除?②若 RL 不兼容 fused 路径,需在此处补回保护或在文档中注明。
Motivation
Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.
Modifications
custom_ops/gpu_ops/grouped_topk_kernels.cu:实现grouped_topk_fused_kernel,一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由;支持 float32/bfloat16/float16 输入custom_ops/gpu_ops/cpp_extensions.cc:新增grouped_topk函数声明及 pybind11 bindingcustom_ops/setup_ops.py:将新.cu文件加入两处编译源列表fastdeploy/model_executor/layers/moe/moe.py:get_moe_scores中use_fused=True时走新grouped_topk路径,替代原fused_cast_sigmoid_bias + noaux_tc双调用tests/operators/test_grouped_topk_op.py:新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试Usage or Command
N/A
Accuracy Tests
fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.